Towards Reliable Large Audio Language Model

Improving Reliability and Trustworthiness in Universal Audio Understanding via LALMs

Published

May 25, 2025

Authors: Z. Ma et al.
Published on Arxiv: 2025-05-25
Link: http://arxiv.org/abs/2505.19294v1
Institutions: X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University • ByteDance • Shanghai Innovation Institute
Keywords: Large Audio Language Model, LALM, Reliability, Reliability Gain Index, RGI, Supervised Fine-Tuning, SFT, LoRA, Qwen2-Audio, IDK Prompting, Multi-Modal Chain-of-Thought, MCoT, Audio Understanding, Speech, Music, Sound, Trustworthiness, Humbleness, Truthfulness, Conservativeness, Transfer Learning, Benchmark, Evaluation Metrics

Random Unsplash-style image

Large Audio Language Models (LALMs) have recently emerged as universal tools for audio understanding and reasoning across modalities including speech, music, and general sound. Despite notable advancements, these models exhibit a critical weakness: they are unable to recognize their own knowledge boundaries and fail to refuse questions that exceed their capabilities. This shortcoming poses threats to reliability in practical scenarios such as healthcare, autonomous driving, and interactive agents. Moreover, conventional evaluation metrics do not adequately assess the effectiveness of LALMs in striking the balance between providing answers (being helpful) and refraining from guessing (demonstrating humbleness).

To address these challenges, the authors propose and systematically evaluate solutions that enhance the reliability of LALMs:

The authors validate their approach with detailed experiments, and the results reveal several important findings:

In conclusion, the study makes significant contributions by initiating the exploration of reliability in large audio language models and offering effective solutions and metrics for evaluation: